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ABSTRACT 


\ 

Using  the  maximum  likelihood  technique  an  algorithm  is  developed 
for  the  extraction  of  pitch  for  speech  that  has  been  corrupted  by  additive 
noise.  The  speech  model  includes  the  effects  of  pitch  periodicity  and 
the  spectral  envelope  which  results  in  a  processing  structure  that 
consists  of  a  noise  suppression  prefilter  in  cascade  with  a  comb  filter 
bank  estimator-correlator.  The  prefilter  attenuates  those  frequency 
bands  where  the  speech  signal-to-noise  ratio  is  low,  hence  most  of  the 
deleterious  noise  is  rejected  prior  to  the  determination  of  pitch  by  the 
comb  filter  bank  correlator.  The  comb  filter  interpretation  leads  to  an 
implementation  of  the  correlation  function  which  avoids  the  problem  of 
anomalous  pitch  errors  due  to  the  effects  of  windowing  and  formant 
sidelobe  interaction  which  obviates  the  need  for  any  type  of  spectral 
flattening.  Pitch  ambiguities  are  resolved  using  a  majority  logic 
scoring  algorithm  and  a  carefully  designed  pitch  tracker  that  can  adapt 
rapidly  to  gross  pitch  variations.  The  voiced/unvoiced  decision  is 
based  on  an  adaptive  minimum  energy  threshold,  a  high/low  band  energy 
measurement,  a  normalized  pitch  correlation  coefficient  and  a  pitch  track 
continuity  coefficient.  A  time  domain  implementation  of  the  algorithm 
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I.  INTRODUCTION 


With  the  development  of  high-speed  minicomputer  voice  terminals  it 
has  become  possible  to  deploy  low  bit-rate  speech  encoding  algorithms  in 
real-world  operational  environments.  In  some  of  these  applications  the 
speech  is  corrupted  by  an  additive  acoustical  background  noise  which 
oftentimes  results  in  a  significant  reduction  in  intelligibility  [1]. 

This  has  stimulated  research  into  the  investigation  of  more  robust 
algorithms  for  estimating  the  speech  parameters.  In  this  paper  the 
focus  is  on  the  development  of  a  robust  pitch  extractor  based  on  the 
maximum  likelihood  technique.  While  this  approach  has  already  been  ex¬ 
plored  by  several  authors  [2]  [3]  [4],  none  of  the  models  on  which  the 
analyses  were  based  have  taken  into  account  the  effects  of  the  spectral 
envelope,  and  as  a  result  pitch  periodicity  was  the  only  discriminant 
that  was  used  to  combat  the  noise.  When  the  envelope  structure  is  in¬ 
cluded  in  the  basic  model,  as  is  done  in  this  paper,  the  resulting 
analysis  leads  to  an  algorithm  that  consists  of  a  noise  suppression 
prefilter  in  cascade  with  a  comb  filter  bank  correlator.  The  prefilter 
attempts  to  attenuate  those  frequency  bands  where  the  speech  signal -to- 
noise  ratio  is  low,  hence  most  of  the  deleterious  noise  is  rejected 
prior  to  the  determination  of  pitch  by  the  comb  filter  bank  correlator. 

The  comb  filter  interpretation  leads  to  an  implementation  of  the  correlation 
function  which  avoids  the  problem  of  anomalous  pitch  errors  due  to  the 
effects  of  windowing  and  the  formant  sidelobe  interaction  which  obviates 
the  need  for  any  type  of  spectral  flattening  prior  to  pitch  estimation. 
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Pitch  ambiguities  are  resolved  using  a  majority  logic*  scoring  algorithm 
and  a  carefully  designed  pitch  tracker  that  can  adapt  rapidly  to  legitimate 
trailing  edge  pitch  doubles.  The  voiced/unvoiced  decision  is  based  on 
an  adaptive  minimum  energy  threshold,  a  high/ low  band  energy  measurement, 
a  normalized  pitch  correlation  coefficient  and  a  pitch  track  continuity 
coefficient.  The  paper  describes  a  time  domain  implementation  of  the 
algorithm  that  runs  in  real  time  in  conjunction  with  an  LPC  spectral 
analysis/ synthesis  system  operating  at  2.4  kbs  or  3.6  kbs. 

In  Section  11  the  pitch  estimation  problem  is  formulated  within  the 
framework  of  statistical  estimation  theory,  and  a  sufficient  statistic 
for  the  pitch  estimator  is  derived.  The  ambiguity  resolution  logic  and 
the  design  of  the  pitch  tracker  are  presented  in  Section  111,  and  the 
rationale  for  the  buzz-hiss  detector  is  discussed  in  Section  IV.  The 
algorithm  is  currently  being  subjected  to  extensive  testing  using  the 
Diagnostic  Rhyme  Test  (DRT)  for  clean  speech  and  speech  that  has  been 
corrupted  by  additive  L4A  Advanced  Airborne  Command  Post  (ABCP)  noise. 

II.  DERIVATION  OF  THE  SUFFICIENT  STATISTIC 

Based  on  a  set  of  noisy  observations  of  a  voiced  speech  waveform  it  is 
desired  to  determine  the  "best"  estimate  of  the  pitch  period.  The  maximum 
likelihood  estimator  is  selected  since  it  is  easy  to  compute  and  for  large 
signal-to-noise  ratio  (SNR),  it  is  asymptotically  unbiased  and  efficient 
(the  variance  converges  to  zero).  The  estimate  is  based  on  the  data 


(1) 


where  v  and  w^  represent  the  n'th  sample  of  the  voiced  speech  and  noise 

waveforms  respectively.  To  begin  with  the  acoustic  interference  is 

2 

assumed  to  be  zero  mean,  white  Gaussian  noise  with  variance  .  Once 
the  estimator  has  been  derived  for  this  case,  the  generalization  to 
colored  noise  follows  immediately  using  the  analysis  technique  of  the 
prewhitening  filter.  The  voiced  speech  waveform  is  modelled  as  a  sample 
function  of  a  zero  mean,  Gaussian,  quasi-periodic  random  process  having 
covariance  function  Ry(k)  =  Rv(k+x)  where  t  is  the  period  of  the  process. 
This  means  that  almost  every  sample  function  is  periodic  [5],  The 
likelihood  function  is  the  probability  density  function  for  the  observa¬ 
tions,  which  is 

£(t)  =  p(y1,y2,...,yn|T)  (2) 

Schweppe  [6]  has  shown  that  for  stationary  Gaussian  processes  the  log- 
likelihood  ratio  is 


L(t)  =  -N£n^ap2 


2  n?l  ^yn’vn|n-P 


(3) 


where  vn|n  ^  is  the  minimum  mean  squared  error  prediction  of  vn  based  on 
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measurements  up  to  time  n-1,  namely  y^  y^  2>‘"  ant*  °p  the 
prediction  error  variance  obtained  by  averaging  over  the  ensemble  of 
speech  and  noise  sample  functions.  In  practice  the  background  noise 


3 


level  is  unknown  which  renders  op  a  nuisance  parameter. 
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likelihood  estimate  of  op"  is  found  by  maximizing  (3)  and 


.  2,  ,  i  n  ,  . 

^  N  E.  ^n'Vn|n-l 

r  n=l  1 


The  maximum 
is* 


(4) 


from  which  the  log-likelihood  function  reduces  to 


L(t)  =  -£n[op2(r)]  (5) 

Therefore  the  maximum  likelihood  estimate  of  the  pitch  period 

(denoted  as  f)  can  be  found  by  choosing  x  to  minimize  the  energy  in  the 

prediction  residual.  In  order  to  avoid  the  issue  of  explicitly  implementing 

the  predictor,  it  suffices  to  recognize  that  as  a  consequence  of  the 

Innovations  Theorem  [7]  and  the  fact  that  speech  is  a  Markov  process, 

the  residual  sequence  e  =  y  -  v  i  ,  is  zero  mean  white  noise  when  f 
n  n  'n  n|n-l 

is  "close  to"  the  true  pitch  period.  As  a  result,  the  transformation 
from  the  input  sequence  {y^}  to  the  sequence  of  residuals  {e^}  is  a 
linear  whitening  filter.  This  provides  a  necessary  condition  which  is 
used  in  the  appendix  to  specify  the  structure  of  the  maximum  likelihood 
estimator.  It  is  shown  that  the  first  stage  of  processing  is  a  noise 
suppression  prefilter.  This  is  specified  by  the  transfer  function  P(w) 

*Vn|n  i  depends  implicitly  on  the  pitch  period  x. 
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which  must  satisfy 


|P(w)  i  *  -  ■  s  (6) 

E  (u) )  *0 

w 

where  F(,w)  is  the  spectral  envelope  of  the  voiced  speech  (the  Fourier 
transform  of  the  correlation  function  R^(t)).  The  action  of  the  filter 
is  to  suppress  those  frequencies  where  the  SNR  is  low  and  to  pass  those 
frequencies  where  the  SNR  is  high.  Of  course  specification  of  P(u>) 
requires  knowledge  of  the  speech  spectral  envelope  of  E(w)  which  is  not 
available  a  priori.  Techniques  for  estimating  E(u>)  from  noisy  data 
(which  is  not  necessarily  white)  and  for  implementing  P(w)  have  been 
developed  by  McAulay  and  Malpass  [8], 

Having  used  the  spectral  envelope  information  to  enhance  the  noisy 
speech  waveform,  the  next  stage  of  processing  exploits  the  voiced  speech 
periodicity  to  determine  the  pitch  using  correlation  techniques.  If 
Y (w)  represents  the  discrete  Fourier  Transform  tDFT)  of  the  prefilter 
input,  and 

X(u>)  =  P(u>)Y(<e)  (7) 


represents  the  OFT  of  the  prefilter  output,  then  it  is  also  shown  in  the 
appendix  that  the  pitch  likelihood  function  reduces  to 


where 
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represents  the  output  of  a  comb  filter  tuned  to  pitch  period  t.  The 
action  of  this  filter  is  to  produce  an  estimate  of  the  waveform  x^, 
presuming  its  periodicity  to  have  period  t.  A  bank  of  comb  filters  is 
needed  for  each  possible  pitch  candidate  in  the  range  of  interest  (i.e., 
40  Hi  -  300  Hi)  and  the  pitch  corresponding  to  the  comb  filter  that 
leads  to  the  largest  correlation  determines  the  maximum  likelihood 
estimate  of  the  pitch  period.  A  block  diagram  illustrating  the  signal 
processing  requirements  to  compute  £(t)  is  shown  in  Figure  1. 

Substitution  of  (9)  into  (8)  would  show  that,  except  for  the  pre¬ 
filter,  the  maximum  likelihood  pitch  estimator  is  basically  another 
version  of  the  correlation  function  pitch  extractor  [9],  [10],  However, 
the  standard  implementation  of  this  technique  is  to  choose  a  processing 
interval  that  is  wide  enough  to  include  two  or  three  periods  of  the 
speech  waveform.  The  autocorrelation  function  of  the  windowed  data  is 
then  computed.  The  effect  of  this  is  to  apply  a  triangular  window  to 
the  true  correlation  function  which  causes  the  sidelobes  introduced  by 
the  formant  resonance  to  result  in  peaks  that  can  be  larger  than  the 
peak  at  the  true  pitch  period,  especially  if  the  formant  bandwidth  is 
narrow.  This,  in  turn,  leads  to  large  anomalous  pitch  errors.  It  has 
been  the  desire  to  remove  the  formant  interaction  that  has  led  to  the 
use  of  spectral  flattening  19],  110]  prior  to  the  computation  of  the 


o 


estimator 


autocorrelation  function.  However,  this  not  only  complicates  the  pro¬ 
cessing  but  also  leads  to  an  undesirable  reduction  in  the  SNR.  These 
problems  have  been  eliminated  completely  by  interpreting  (9)  as  a  comb 
filter  which  operates  continuously  on  the  prefilter  output  waveform  to 
produce  the  estimate  of  the  voiced  speech  waveform  x^.  Therefore  use  of 
the  maximum  likelihood  method  has  led  to  a  signal  processing  interpretation 
of  the  correlation  operation  in  terms  of  an  estimator-correlator  structure, 
which  although  a  seemingly  trivial  detail,  has  a  profound  effect  on  the 
properties  of  the  pitch  likelihood  function.  The  cost  for  these  benefits  is 
buffering  which  must  be  provided  for  Xq,  x  ^...x  t  and  *N+j,  xn+2,*-,xn+t 
where  T  is  the  largest  pitch  period  of  interest.  For  this  implementation 
the  correlation  window  size,  N,  only  needs  to  be  wide  enough  to  include 
at  least  one  pitch  period,  although  in  practice  it  is  convenient  to  use 
the  same  width  as  the  spectral  analysis  frame  size. 

In  general  it  is  desired  to  evaluate  (8)  using  the  full  bandwidth 
speech  (3787.5  Hz  in  this  implementation)  since  the  likelihood  function 
can  profitably  extract  the  harmonic  structure  wherever  it  is  available 
in  the  frequency  band.  This  is  especially  important  in  the  noisy  speech 
problem  where  low  frequency  noise  may  cause  the  prefilter  to  attenuate 
the  harmonics  near  the  first  formant.  In  an  attempt  to  cover  the  pitch 
range  40  -  300  Hz,  the  real  time  capabilities  of  the  LDVT  [11]  were 
exceeded,  hence  it  was  necessary  to  restrict  the  bandwidth  to  1/4  the 
sampling  rate  (7575  Hz  in  this  implementation)  so  that  the  input  speech 
(the  prefilter  output)  could  be  downsampled  by  2:1.  Since  this  bandwidth 


restriction  also  applied  to  the  likelihood  function  it  sufficed  to 
evaluate  f S )  at  every  other  pitch  sample.  The  missing  values  were  then 
obtained  by  using  the  S2:19:-J  2:1  up-sampling  filter  (12], 

A  typical  example  of  the  likelihood  function  that  was  computed 

subject  to  the  preceding  conditions  is  shown  in  figure  2a  for  a  male 

speaker  for  the  vowel  /u/  as  in  chew.  Another  example  in  figure  2b 

computed  for  a  female  speaker  for  the  vowel  /•  as  in  that  shows  more 

clearly  the  effect  of  a  narrow  formant  bandwidth.  In  spite  of  the 

multiplicity  of  peaks  at  the  formant  frequency,  the  peak  corresponding 

to  the  true  pitch  period  is  evident.  This  has  always  been  found  to  be 

the  case  for  the  large  number  of  utterances  that  have  been  examined.  It 

is  obvious  that  if  the  autocorrelation  function  had  been  computed  m  the 

usual  way,  the  triangular  window  would  have  totally  obscured  the  peak  at 

the  true  pitch.  While  these  steady-state  results  could  have  been  obtaine 

by  computing  either  1'x  x  or  lx  x  ,  it  was  found  that  using 

lx  lx  +  x  1  ,  as  required  bv  (8)  and  19)  ,  resulted  in  a  more  stable 
nv  n-x  n+x  1  '  v  ‘  y 

likelihood  function  at  the  leading  and  trailing  edges  of  a  phoneme  and 
during  pitch  and  vowel  transitions. 

Although  the  proposed  implementation  has  eliminated  the  problem  ot 
formant  interaction  without  using  spectral  flattening,  it  has  resulted 
in  the  introduction  of  ambiguous  peaks.  In  Figure  .'a.  for  example,  the 
peaks  at  9.9  ms  and  14.9  ms  are  as  well-defined  as  the  peak  at  the  true 
pitch  of  4.88  ms.  Since  minor  perturbations  in  these  peak  values  will 
occur,  a  simple  peak  picking  algorithm  will  lead  to  pitch  doubling  and 
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pitch  tripling  errors.  Since  there  is  nothing  that  can  he  done  to  avoid 
this  problem  using  only  a  single  t'ratne  of  data,  heuristics  must  he 
introduced  using  data  from  multiple  frames  to  resolve  t he  ambiguous 
peaks.  This  issue  is  discussed  in  the  next  section. 

111.  PITCH  AMBIGUITY  Rl SOLUTION 

The  algorithm  for  resolving  the  pitch  ambiguities  is  based  on  the 
majority  logic  decision  scheme  developed  by  Gold  |13]  [  1  ■>  | .  To  begin 
with  a  set  of  five  elementary  pitch  estimators  is  constructed  by  searching 
the  likelihood  function  for  the  largest  peak  lying  in  each  of  the  unambigu 
ous  intervals  3-5.8  ms.,  5.8-8.84  ms.,  8.84-11.88  ms.,  11.88  14. 92  ms., 
and  14.92-25  ms.  where  a,  b,  c,  d,  c  refer  to  the  corresponding  pitch 
periods  at  which  each  peak  occurs.  If  an  interval  contains  no  peak  (the 
slope  must  change  sign  for  a  peak  to  be  defined),  the  pitch  period  is 
sot  equal  to  zero.  A  pitch  tracker  is  presumed  to  exist  (this  will  be 
described  subsequently)  which  is  specified  in  terms  of  a  slow  average 
pitch,  PSLOW,  a  fast  average  pitch,  Pl'AST,  a  lower  tracker  limit,  TI.O, 
and  an  upper  tracker  limit,  fill.  The  largest  peak  that  falls  within  the 
tracker  window  (TLO,  Till]  is  labelled  TRkMAX,  while  the  pitch  period  at 
which  this  peak  occurs  is  labelled  TRKTAU.  The  largest  of  all  of  the 
peaks  is  labelled  GBLMAX.  If  any  of  the  five  peaks  lies  below  . 8 5  *G  HI  .MAX 
then  the  corresponding  pitch  period  is  set  to  zero.  A  better  feeling 
for  these  definitions  can  be  obtained  by  referring  to  figure  2  where  all 
of  the  quantities  have  been  labelled. 


The  next  step  is  to  associate  two  pitch  period  estimates  with  the 
output  of  each  of  the  elementary  pitch  estimators:  that  for  frame  m, 
(the  current  frame)  and  for  frame  m-1  (the  previous  frame).  These  are 


labelled  a,  a',  b,  b',  ...  e,e'.  Of  the  two  only  the  most  recent  period 
is  a  candidate  for  the  final  pitch  estimate  (a,b,c,d  or  e)  the  other 
quantities  are  used  only  for  scoring  purposes.  Figure  3  summarises  the 
scoring  strategy  as  obtained  for  two  frames  of  the  vowel  /u/  where 
Figure  2a  represents  the  data  for  frame  m.  Each  of  the  six  current 
pitch  candidates  is  compared  against  25  quantities,  itself  included. 

Ten  of  the  25  are  the  two  frame  measurements  from  each  of  the  five 
elemental  pitch  estimators.  An  additional  14  checks  are  obtained  by 
computing  the  pitch  values  b/2,  c/2,  d/2,  e/2,  c/3,  d/3,  e/3,  b ' /2 , 
c'/2,  d ' / 2 ,  e'/2,  c'/3,  d ' / 2 ,  e'/3,  which  account  for  the  presence  of 
likelihood  function  peaks  at  double  and  triple  the  true  pitch,  as  was 
the  case  in  the  examples  in  Figure  2.  The  final  check  is  with  respect 
to  the  fast  average  pitch,  I1  FAST,  as  this  takes  into  account  the  longer 
term  properties  of  the  pitch  estimates  computed  over  several  frames.  As 
indicated  in  Figure  3,  a  pitch  candidate  receives  a  vote  if  the  test 
period  against  which  it  is  compared  is  within  a  given  percentage  of  the 
candidate  value.  The  value  of  W  (Figure  3)  was  chosen  to  be  1/8.  The 
candidate  that  receives  the  highest  score  is  taken  as  the  trial  pitch 
estimate  for  frame  m  and  is  labelled  LASTPT.  The  corresponding  value  of 
the  likelihood  function  is  stored  as  I.ASTMX.  Finally  if  the  ambiguity 
resolved  pitch  estimate  LASTPT  lies  within  the  tracker  window  [TLO.TIU) 
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Fig. 3.  Scoring  algorithm  for  pitch  ambiguity  resolution 


then  the  pitch  estimate  for  frame  m,  I' (in)  is  set  equal  to  LASTPT,  other¬ 
wise  P(m)  is  taken  to  be  the  tracker  pitch  estimate  TRKTAl) .  Usually  the 
former  step  is  taken,  but  sometimes,  especially  during  vowel  transitions 
the  tracker  estimate  takes  precedence  thereby  maintaining  the  continuity 
of  the  pitch  track.  To  illustrate  the  concepts  more  clearly  the  algorithm 
when  applied  to  the  likelihood  function  in  Figure  2a  results  in  the 
following  candidates  for  the  pitch  estimates:  a=4.8S,  b=0,  c=9.9,  d=0, 
e=14.9.  From  Figure  3,  candidate  "a"  with  a  score  of  7  will  be  the 
choice  for  the  unambiguous  pitch  estimate.  Hence  LASTPT=4 . 88  and  since 
TLO=3.09  and  T1II*7.21,  the  tracker  window  encloses  LASTPT  and  hence  the 
pitch  estimate  for  the  m1*'  frame  is  P(m)=4.88  which,  as  is  most  often 
the  case  the  same  as  TRKTAU.  The  reason  for  this  is  a  result  of  the 
correctly  placed  tracker  window.  That  this  happens  most  of  the  time  is 
due  to  the  design  of  the  pitch  tracker.  The  tracker  limits  are  set 


according  to  the  rule 

TLO(m)  =  .  t>*PFAST 

(10a) 

Till  (m)  =  1 . 4*PFAST 

(10b) 

where  the  "fast"  and  "slow"  pitch  averages  evolve  according  to  the  re¬ 
cursions 


ilia) 


PFAST(m)  =  PFAST(m- 1)  +  QFAST* [LASTPT  -  PFAST(m-l)] 
PSLOW (m)  =  PSLOW(m-l)  +  QSLOW* [LASTPT  -  rSLOK(m-l)] 


(lib) 


where  QFAST  and  QSLOW  control  the  time  constants  of  the  "fast"  and 
"slow"  averaging  filters.  In  the  real  time  implementation  frames  were 
processed  every  21  ms,  hence  PSLOW  represented  a  long  term  estimate  of 
the  pitch  for  a  particular  speaker  by  setting  QSLOW  =  .03  (.69  sec  time 
constant).  Should  a  new  speaker  having  a  radically  different  pitch  use 
the  same  vocoder,  then  PSLOW  is  guaranteed  to  adapt  to  the  new  speaker 
since  it  tracks  the  unambiguous  pitch  estimate  LASTPT  which  is  not 
influenced  in  any  way  by  the  previous  tracker  settings.  Although  it  is 
tempting  to  set  up  a  tracker  window  about  this  long  term  average  value, 
it  has  been  found  that  for  some  speakers  wide  fluctuations  in  pitch  can 
occur  within  a  given  utterance  which  demands  more  dynamic  adaptivity  in 
setting  the  tracker  limits.  It  is  for  this  reason  that  TLO  and  THI  are 
tied  to  the  fast  average  pitch  PFAST  which  is  up-dated  using  a  much 
shorter  time  constant  by  setting  QFAST  =  .35  (49  ms  time  constant).  As 
a  consequence  PFAST  represents  an  estimate  of  the  short  term  average 
pitch,  hence  it  can  be  used  as  a  useful  input  to  the  scoring  table  for 
pitch  ambiguity  resolution  as  well  as  producing  tracker  limits  that 
adapt  more  quickly  to  rapid  pitch  variations.  An  extreme,  although  not 
uncommon  case,  is  illustrated  in  Figure  4  which  shows  the  pitch  track 
and  the  tracking  parameters  for  the  utterance  "1  heard  that  shot  echo" 
spoken  by  a  female.  At  the  end  of  the  vowel  /o/  in  echo  the  pitch 
doubles.  However,  after  two  frames  the  upper  pitch  tracking  limit 
adapted  to  a  value  that  was  large  enough  to  include  the  true  pitch 
period. 


|  I  |  HEARD  j  THAT  |  SHOT  |  ECHO 


FRAME  NUMBER  !X21m»«c) 


Fig. 4.  Typical  trajectories  for  the  pitch  tracker  parameters 


While  the  fast  adaptation  capabilities  are  needed  to  track  the 
increasing  pitch  period  at  the  trailing  edge  of  an  utterance,  secondary 
problems  can  arise  if  the  lower  tracker  limit  adapts  according  to  (10) 
without  constraint.  For  example,  if  the  female  speaker  who  uttered  the 
phrase  "1  heard  that  shot  echo”  had  continued  speaking,  it  is  most 
likely  that  the  pitch  period  would  have  returned  to  a  value  near  the 
long  term  average  of  4.o2  ms.  However,  because  of  the  trailing  edge 
pitch  double,  the  short  term  average  pitch  was  10.43  ms,  and  from  (10) 
the  lower  tracker  limit  would  have  been  b.25  ms  which  would  have  blocked 
out  subsequent  pitch  period  estimates.  Since  PSLOW  represents  the  long 
term  average  pitch,  the  lock-out  problem  can  be  avoided  by  requiring 
that 

TLO  < . 6*PSL0W  (12) 

For  the  example  PSLOW  was  4.62  ms  at  the  end  of  the  utterance,  hence  TLO 
was  constrained  to  be  below  2.77  ms  which  completely  eliminated  the 
lock-out  problem. 

Since  it  is  also  possible  for  the  pitch  period  to  decrease  rapidly 
during  certain  inflections  (this  seems  to  be  a  less  frequent  event)  a 
similar  lock-out  problem  exists  with  respect  to  the  upper  tracker  limit. 
Hence  the  constraint 

1.4*PSL0W  <  THI  (13) 

is  used  to  eliminate  the  possibility  that  this  lock-out  event  can  occur. 
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As  a  final  precaution  that  pitch  inflections  of  the  type  described 
hill  not  cause  the  pitch  tracker  to  settle  on  an  inappropriate  orientation 
during  unvoicing,  the  fast  tracker  parameters  are  relaxed  to  the  steady- 
state  values 


TLO  =  .0* PSLOW 

(14) 

PFAST  =  PSLOW 

(15) 

TH1  =  1.4 ‘PSLOW 

(16) 

To  accomplish  this,  PFAST  is  adapted  by  driving  recursion  (11a)  with 
PSLOW  as  the  excitation  and  computing  TLO  and  Till  from  (101  subject  to 
constraints  (Id)  and  (13)  whenever  a  frame  of  unvoiced  speech  is  detected. 
The  slow  tracker  average  is  not  altered  during  such  an  unvoiced  classifi¬ 
cation.  The  effect  of  the  tracker  relaxation  is  illustrated  in  Figure  4 
during  the  silent  interval  at  the  end  of  the  utterance. 

In  the  preceding  discussion  use  has  been  made  of  a  butt-hiss  de¬ 
tector  to  determine  when  to  update  the  tracker  limits.  The  voicing 
detection  algorithm  will  be  discussed  in  the  next  section. 

IV.  VOICED-UNVOICED  SPEECH  DETECTION 

The  buzz-hiss  detector  is  probably  the  most  critical  component  of  a 
narrowband  vocoding  system  since  it  not  only  impacts  significantly  on 
intelligibility  but  has  a  profound  effect  on  user  acceptability.  While 
buzz-hiss  algorithms  have  been  developed  that  work  well  on  clean  speech. 


IS 


problems  often  arise  when  the  same  algorithms  are  applied  to  speech 
which  has  been  distorted  due  to  additive  noise.  This  is  expecially  true 
for  noise  sources  having  a  well-defined  spectral  characteristic  at  low 
frequencies,  such  as  F4A  advanced  airborne  command  post  noise,  since  the 
noise  waveform  has  many  of  the  attributes  of  voiced  speech.  Since  the 
maximum  likelihood  pitch  estimator  uses  a  noise  suppression  prefilter  to 
enhance  the  signal-to-noise  ratio  prior  to  pitch  correlation,  it  is 
reasonable  to  design  the  buzz-hiss  logic  to  deal  with  speech  contaminated 
by  a  low  level  residual  noise.  No  single  discriminant  has  been  found 
that  represents  a  necessary  and  sufficient  condition  for  voiced  speech, 
hence  a  number  of  tests  were  used  in  sequence.  These  tests  were: 

1.  minimum  energy  threshold 

2.  high/ low  band  energy  measurements 

3.  pitch  correlation  coefficient 

4.  pitch  track  continuity. 

A  flow  chart  for  the  voicing  logic  is  shown  in  Figure  5  and  will  now  be 
described  in  detail. 

Test  1 :  In  the  first  test  the  energy  for  one  frame  of  prefiltered 
speech  is  computed  in  block  floating  point  format  for  m'th  frame  as 


e(m) 


■( 


N 

I 


,SCLFCT 


(17) 


where  v  is  the  prefilter  output*  and  SCLFCT  =  1,2,...  if  an  overflow 

occurs.  If  u(m-l)  represents  an  estimate  of  the  background  noise  energy 

computed  on  frame  m-1,  then  an  unvoiced  or  silence  speech  classification 

-  8 

is  made  if  SCLFCT  =  0  and  e(m)  -  u(m-l)  <  2  (v^  is  treated  as  a  lb  bit 

fraction).  The  description  of  the  computation  of  u(m-l)  will  be  deferred 
until  later  in  this  section;  however,  for  clean  speech  w(m-l)  will  be 
:ero . 

Te st  2 :  Once  the  energy  test  has  been  passed,  a  simple  check  is 
made  on  the  structure  of  the  correlation  function.  For  all  voiced 
sounds  the  correlation  function  has  at  least  one  zero  crossing.  While 
this  is  also  the  case  for  most  unvoiced  sounds,  it  sometimes  happens 
that  for  low  level  noise  and  some  unvoiced  sounds  such  a  zero  crossing 
does  not  occur,  and  these  cases  are  immediately  classified  as  unvoiced 
speech.  The  test  is  implemented  by  determining  if  the  minimum  value  of 
the  pitch  likelihood  function  (labelled  GBLMIN)  is  positive. 

Test  5:  The  next  test  measures  the  high/low  spectral  balance  of 
the  speech  energy.  Since  energy  in  unvoiced  sounds  is  generally  located 
at  high  frequencies,  measuring  the  spectral  energy  balance  is  a  powerful 
voicing  discriminant.  If  L(u>)  represents  a  low  pass  filter  and  H(u>)  a 
high  pass  filter,  then 

*In  the  real  time  program  v  is  measured  at  the  output  of  the  2:1 
downsampling  filter. 
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E ,  =  I  I L(w) |  | Y(w) |  dw 


'■/' 


E,  =/  |H(w)  |?|Y(w)  |2dw 


(18) 


(19) 


measure  the  low  and  high  pass  energies  of  the  wideband  speech  signal  y 
The  detection  logic  is  to 


(a)  declare  voiced  if  —  >  X 

v  '  F  v 

H  V 


(20) 


fb)  declare  unvoiced  if  —  <  X 

E  u 
u 


(21) 


The  voicing  decision  is  deferred  whenever  X  <  E./E,,  <  X  .  The  threshold 
6  v  L  H  u 

settings  depend  on  the  exact  filters  used  in  (18)  ,  but  a  particularly 
convenient  choice  is  to  take 


|L(u) 


COS  0) 


0  <  co  ^  TT_ 
2 


Y  <  0)  <  IT 


(22) 


|H(w) 


0  <  a)  <  y 


I  COS  0)|  2"  <  W  <  71 


(23) 
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Is  a  consequence  of  those  definit  tons ,  the  difference  between  the  lot. 
and  high  energies  >  s  given  1>\ 


u'-n 


w  It  t  v  It  is  the  measured  correlation  tune  turn  .» t  unit  delay.  Since  the 
total  onerg>  in  the  hand  t  •• 


r 

o 


N 

V 

n  l 


l-'M 


which  t'.  the  measured  correlation  function  at  jcro  delay,  and  since  the 
total  enerjj'  Is  approximate! \  equal  to  the  sum  of  the  energies  out  ot 
the  low  and  lugh  pass  filters,  then 


ro  *  \  +  ‘.I 


U'ol 


lomhniing  l>l.  (.'(>1  att.l  i-'t')  results  in  the  detection  logic 


r  \  l 

la]  declare  voiced  if  s'  \  . 

r  l  ,  -  ’v 

o  v  ♦  I 


I2"a) 


(b)  declare  unvoiced  i  1' 


l  :7b) 


r 

o 


u-l 


11*1 


u 


which  is  a  well-known  buzz-hiss  discriminant.  However,  the  present 

derivation  relates  t  lie  r./r  thresholds  to  the  thresholds  tor  the  low 

1  o 

and  high  pass  energies,  lor  speech  that  lias  not  been  preemphasized',  it 
has  been  found  that  a  sufficient  condition  for  unvoiced  speech  classifica¬ 
tion  is  achieved  with  X  ■  -.7  which  corresponds  to  the  requirement  that 
the  l\ i >;h  pass  energy  be  50%  larger  than  the  low  pass  energy  (i.e., 

X  (  =  7/.X)  .  l-'or  voiced  speech  it  has  been  found  necessary  to  require 
that  X  =  .07  which  corresponds  to  the  condition  that  the  low  pass 
energy  be  bfo  times  the  high  pass  energy  (i.e.,  X  *  t>6 1 .  While  this 
seems  overly  conservative,  any  smaller  value  has  been  found  to  lead  to 
many  erroneous  unvoiced-to-voiced  classifications  particularly  for  the 
plosives.  Therefore  the  test  in  (77a)  is  mainly  used  to  obtain  correct 
voiced  speech  classifications  of  the  nasals. 

Branch  l:  Since  the  high/low  energy  is  not  in  itself  a  complete 
test  for  voicing,  a  further  subdivision  in  the  class  of  speech  events  is 
obtained  by  measuring  the  continuity  of  the  pitch  track.  IVo  pitch 
continuity  coefficients,  TUNA  and  1,TRK,  are  computed,  one  for  the  un 
ambiguous  pitch  estimate  LASTPT  and  one  for  the  tracker  pitch  estimate 

*ln  the  real  time  implementation  the  speech  undergoes  analog  preemphasis 
in  order  to  enhance  the  dynamic  range  of  the  17  bit  A/P  converter  and  to 
precondition  the  speech  for  1,1V  spectral  analysis.  Therefore  a  digital 
deemphasis  filter  is  used  prior  to  the  computation  of  r  ^  and  tj. 


TRKTAU.  Letting  PCANl)  denote  the  candidate  pitch  and  letting  TEST 
denote  the  pitch  period  against  which  it  is  being  tested  for  coincidence 
then  a  pitch  correlation  occurs  if 


|  PCANl)  -  TEST  | 


PCANl) 

8 


(28) 


This  definition  is  applied  to  the  evaluation  of  t he  unambiguous  pitch 
continuity  coefficient  as  follows:  If  LASTPT(m)  does  not  correlate  with 
LASTPT(m-l)  then  LUNA  =  0.  If  LASTPT(m)  correlates  with  LASTPT(m-l) 
then  LUNA  =1.  If  LUNA  =  1  and  LASTPT(m)  correlates  with  LASTPT(m-2) 
then  LUNA  =  2.  The  tracker  pitch  continuity  coefficient,  LTRk,  is 
computed  in  t he  same  way  replacing  LASTPT(i),  by  TRKTAU (i ) ,  i  =  m,  m-1, 
m-2.  At  the  branch  point  the  voicing  algorithm  declares  a  broken  pitch 
track  if  LUNA  =  0  and  LTRK  =  0,  that  is  if  neither  of  the  frame  m  and 
frame  m-1  pitch  estimates  correlate. 

Test  4:  The  principal  reason  for  the  test  for  the  broken  pitch 
tract  is  to  eliminate  the  plosives  as  possible  candidates  for  voiced 
speech  classification.  A  typical  example  of  the  likelihood  function  for 
a  plosive  is  illustrated  in  Fig.  t>  which  shows  that  although  the  peaks 
are  not  insignificant,  the  largest  value  occurs  at  a  randomly  oriented 
pitch  which  tends  to  decorrelate  with  respect  to  previous  values. 
Therefore  if  the  branch  declares  in  favor  of  an  unbroken  pitch  track  it 
is  highly  unlikely  that  the  sound  is  a  plosive.  This  means  that  the 
threshold  on  t he  spectral  balance  measure  r^/r^  can  be  relaxed  and  a 
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Fig, 6.  The  pitch  likelihood  function  for  the  plosive  /p/  as  in 
"prison." 
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weaker  threshold  is  set  at  ^  =  .2  which  results  in  a  voiced  speech 
classification  whenever  the  low  band  energy  is  50%  larger  than  the  high 
band  energy. 

Test  5:  Since  it  may  happen  that  microphone  distortion  or  prefilter 
effects  could  affect  the  spectral  balance  in  such  a  way  that  the  preceding 
test  for  voiced  speech  might  fail,  as  a  final  precaution  another  test 
for  pitch  continuity  is  made.  In  this  case  if  either  LUNA  =  2  or  LTRK  =  2, 
that  is  if  either  the  unambiguous  pitch  estimate  or  the  tracker  pitch 
estimate  correlates  with  the  corresponding  pitch  estimates  on  the  previous 
two  frames  then  the  pitch  track  is  declared  "smooth"  and  a  voiced  speech 
classification  is  made. 

Test  6:  It  was  argued  that  the  branching  test  for  a  broken  pitch 
track  was  needed  in  the  classification  of  the  plosives.  However  a  pitch 
track  discontinuity  can  also  occur  if  the  pitch  is  in  a  rapid  transition. 
When  the  latter  event  occurs  the  normalized  pitch  likelihood  function  is 
usually  high,  which  is  characteristic  of  voiced  speech,  but  not  of  a 
plosive.  Therefore  if  £(x)  represents  the  computed  likelihood  function 
at  pitch  period  t,  then  its  value  at  the  unambiguous  pitch  was  defined 
to  be  LASTMX  =  £(LASTPT)  and  an  unvoiced  classification  is  made  if 

LASTMX  ^  1  |?Q^ 

~m-<2  c  J 

This  test  has  the  desired  effect  of  classifying  most  of  the  plosives  as 
unvoiced  speech. 


Test  7:  While  it  is  tempting  to  use  (29)  as  a  necessary  and  sufficient 


condition  for  unvoiced  speech,  it  has  been  found  that  there  are  many 
unvoiced  sounds  for  which  the  normalized  correlation  coefficient  exceeds 
the  50®o  threshold.  In  order  to  correctly  classify  those  voiced  sounds 
corresponding  to  a  pitch  in  transition  (causing  a  broken  pitch  track  and 
a  large  normalized  correlation  coefficient)  another  test  on  the  spectral 
energy  balance  is  used.  In  this  case  if  r^/r^  >  .60,  which  corresponds 
to  the  requirement  that  the  low  pass  energy  be  four  times  larger  than 
the  high  pass  energy,  then  a  voiced  speech  classification  is  made. 

Failing  this  test  results  in  an  unvoiced  speech  classification. 

Background  Noise  Energy  Measurement :  In  order  to  complete  the 
specification  of  the  classifier  algorithm  it  is  necessary  to  describe 
the  method  for  computing  the  average  background  noise  energy  u (n»)  since 
this  quantity  was  used  in  the  very  first  classifier  test.  Basically  the 
averaging  is  done  using  the  first  order  recursion 

u(m)  =  u(m-l)  ♦  a(m)[e(m)  -  u(m-l))  (30) 

where  n(m)  is  the  average  energy  and  e(m)  is  the  measured  energy  for  the 
m'th  frame  computed  according  to  (17).  A  meaningful  estimate  for  the 
background  noise  can  be  obtained  only  if  (30)  is  not  updated  during 
steady  voicing.  Otherwise  the  detection  threshold  would  rise  resulting 
in  erroneous  unvoiced  speech  classifications.  One  way  of  reducing  this 


problem  [15]  is  to  choose  a(m)  adaptively  to  correspond  to  a  4-second 
time  constant  if  the  energy  is  increasing  (i.e.,  if  e(m)  >  p(m-l))and  to 
a  40  ms.  time  constant  if  the  energy  is  decreasing  (i.e.,  if  e(m)  <  n(m-l)). 
In  this  way  p(m)  adapts  to  the  noise  energy  almost  instantly  whenever 
there  is  a  voiced  speech  gap  (which  occurs  about  50%  of  the  time),  while 
during  connected  speech  the  noise  level  increases  slowly  enough  that  the 
noise  power  does  not  take  on  the  attributes  of  speech.  (The  growth  rate 
is  also  clamped  so  that  only  a  25%  increase  is  allowed  in  a  21  ms. 
frame.)  Additional  restrictions  on  when  the  average  energy  can  he 
updated  are  given  in  the  flow  chart  of  Fig.  7. 

Test  1  :  Since  the  speech  data  has  been  preprocessed  by  a  noise 
suppression  filter  the  residual  noise  level  is  small  enough  that  the 
overflow  bit  in  the  energy  computation  (17)  is  never  set.  Therefore  if 
an  overflow  occurs  it  must  be  due  to  the  presence  of  speech,  hence  the 
computation  of  n(m)  is  by-passed  (i.e.,  a(m)  is  set  to  zero). 

Test  2:  Since  all  voiced  speech  sounds  result  in  a  likelihood 
function  having  at  least  one  zero  crossing,  the  lack  of  a  zero  crossing 
can  only  correspond  to  unvoiced  speech  or  noise,  hence  the  average 
energy  is  up-dated. 

Test  3:  Although  there  arc  instances  where  the  normalized  likelihood 
function  falls  below  the  50%  threshold  for  voiced  sounds,  "most  of  the 
time"  such  a  condition  corresponds  to  unvoiced  speech  or  noise,  and  the 
average  energy  is  up-dated. 

Test  4:  Since  pitch  continuity  is  a  strong  indication  of  voicing, 


Fig. 7.  Logic  for  computing  the  average  background  noise  level . 
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(43)  is  not  updated  whenever  LUNA  =  2  or  LTRK  =  2.  Although  this  test 
may  fail  on  occasion,  it  will  not  significantly  affect  the  estimate  of 
the  background  noise  energy. 

V.  EXPERIMENTAL  RESULTS  AND  CONCLUSIONS 

The  preceding  classification  algorithm  has  been  evaluated  using 
the  real  time  program  in  conjunction  with  an  LPC  analysis/synthesis 
system  operating  at  2400  bps  and  3600  bps.  Literally  hours  of  speech 
data  have  been  processed  encompassing  male  and  female  speakers,  airborne 
command  post  and  helicopter  noise  environments  and  dynamic  and  noise¬ 
cancelling  microphones.  Subjectively  the  algorithm  seems  to  be  quite 
robust.  The  major  weakness  is  the  tendency  to  classify  plosives  as 
voiced  speech,  an  effect  which  is  perceptible  mainly  for  high  quality, 
clean,  female  speech. 

For  speech  corrupted  by  additive  E4A  Advanced  Airborne  Command  Post 
noise,  the  buzz-hiss  detector  tends  to  classify  the  noise  as  voiced 
speech  which  results  in  an  unpleasant  buzzy  quality  in  the  speech  syn¬ 
thesis.  When  the  noisy  data  is  processed  by  the  noise  suppression 
filter,  with  a  small  amount  of  suppression,  the  noise  is  classified  as 
hiss  and  is  more  pleasant  to  the  ear.  As  more  suppression  is  introduced 
the  noise  can  be  made  imperceptible  but  at  the  expense  of  buzz-hiss 
errors  especially  for  the  vowel  /i/  as  in  eve.  It  is  conjectured  that 
this  is  due  to  the  fact  that  the  first  formant  lies  in  the  region  about 
270  Hz.  where  there  is  also  a  large  concentration  of  noise  energy.  The 


prefilter  acts  to  suppress  the  noise  at  this  frequency  and  in  so  doing 
also  suppresses  the  speech  signal  (the  amount  of  suppression  depends  on 
the  SNR).  This  means  that  the  pitch  estimate  and  the  voicing  decision 
depend  on  the  second  formant  waveform  which  lies  in  the  vicinity  of 
2290  Hz.  Since  this  is  outside  the  range  of  the  lowpass  2:1  downsampling 
filter  there  is  little  speech  energy  left  to  combat  the  residual  noise. 

A  potential  solution  to  this  problem  is  to  compute  the  pitch  estimate  on 
the  basis  of  the  full  bandwidth  speech.  Unfortunately  this  exceeds  the 
computational  capabilities  of  the  LDVT  when  the  LPC  algorithms  are 
implemented  in  a  full  duplex  node.  In  an  attempt  to  determine  the  value 
of  such  a  wideband  pitch  estimator  a  half  duplex  version  of  the  algorithm 
is  currently  under  development. 
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APPENDIX 

DERIVATION  OF  THE  PITCH  STATISTIC 

Assuming  that  the  pitch  estimate  ?  is  "close  to"  the  true  pitch 
period  then,  as  a  consequence  of  the  innovations  theorem  [7],  the  residual 

A 

sequence  =  y^  -  vn|n_^  is  a  white  noise  process.  This  means  that  the 
transformation  from  {y^}  to  {cn)  is  a  linear  whitening  filter.  Letting 
the  mapping  be  characterized  by  the  transfer  function  H(w;f)  it  must 
necessarily  follow  that* 


|H(u;f)| 


Sy(w;f) 


(.A-l) 


where  the  constant  y  represents  the  power  spectrum  of  the  residual 
sequence  and 

Sy(wit)  =  Sv(to;x)  +  ow2 


(A-2) 


represents  the  power  spectrum  of  the  ensemble  of  noisy  voiced  speech 

waveforms.  Since  the  constant  y  in  (A-l)  affects  only  the  gain  of  the 

filter,  then  with  no  loss  y  can  be  set  equal  to  o  2  which  leads  to  the 

w 

expression 


|H(u»;t)| 


sv(u>;?) 

S  (u;f)+o  2 
vv  J  w 
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*To  be  precise,  this  should  be  written  H[exp  (ju>)]  where  to  =  2trf/f 

s  * 
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The  corresponding  minimum  value  for  the  prediction  residual  energy 
follows  from  Parseval's  Theorem.  It  is 


rli  (yn'Vn|n-l 


t)  |  2  |  Y(u>)  1 2doo 


=  I  |  H(u>; 

J0 

C*  2  fw  S  (u; 

=  /  |  Y  (to)  |  du  -  - - - 

l  J0 


*)  2 
- J  |  Y Coo)  I  du  (A-4) 
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where  Y (co)  is  the  DFT  of  the  measurement  sequence.  This  result  shows 
that  if  the  voiced  speech  spectrum  is  completely  known  except  for  the 
pitch  period  t,  then  the  value  of  T  for  which 


£(x)  = 


■/' 


ir  Sv(oj;t) 


S  (w;T)+a 
o  v v  w 


|  Y  (w)  |  dw 


(A-5) 


is  a  maximum  is  the  maximum  likelihood  estimate.  A  good  model  for  the 
voiced  speech  spectrum  is 


Sv(w;t)  =  E(u))C(uj;t)  (A-6) 

where  E(oj)  represents  the  envelope  of  the  speech  spectrum  (the  DFT  of 
the  correlation  function  Ry(t))  which  modulates  the  periodic  line  structure 
represented  by  C(o);t).  Maximizing  (A-5)  is  therefore  equivalent  to 
maximizing 


£(t) 


|G(cj;t) 


2  2 


|  Y  Coo)  |  dio 


A- 2 


C  A— 7) 


where  0(w;t)  represents  the  linear  filter  which  satisfies 


1 0(w; i ; 


t:.(m)C.(u);T) 
1:  iai)  t'(,a) ;  t  )  +o 


(A-8) 


Therefore  the  likelihood  function  represents  the  energy  at  the  output  of 
a  bank  of  filters,  each  being  tuned  to  a  different  pitch  period.  The 
maximum  likelihood  estimate  of  t  corresponds  to  that  filter  in  the  bank 
for  which  the  output  energy  is  largest.  Since  the  filters  defined  by 
(A-8)  pass  those  frequencies  at  which  the  SNR  is  high  and  reject  those 
at  which  it  is  low,  then  the  effect  of  the  comb  filter  in  t he  denominator 
of  (A-8)  is  to  introduce  nulls  between  the  pitch  harmonics  which  contribute 
to  the  definition  of  those  frequencies  at  which  signal  rejection  occurs. 
Since  the  nulls  also  appear  in  the  numerator,  approximately  the  same 
filtering  performance  can  be  obtained  if  the  comb  filter  in  the  denominator 
is  omitted.  As  a  result  (A-8)  can  be  approximated  by 

| G (u> ; t )  |  =  C(w;x)  .  —  (A-9) 

E(u))+ow~ 

The  second  term  in  (A-9)  can  be  interpreted  as  a  filtering  operation  by 
defining 

|P(to)  1'  =  --^-v  (A- 10) 

E(ui)  +ow~ 

Its  function  is  to  use  the  information  in  the  spectral  envelope  to  pass 
those  frequencies  for  which  the  speech  SNR  is  large  and  reject  those  for 
which  it  is  small.  Therefore  I' (to)  can  be  interpreted  as  a  noise  suppression 


A-3 


prefilter.  An  adaptive  algorithm  for  implementing  such  a  prefilter  has 
been  described  in  detail  in  reference  [8],  Letting  the  output  of  this 
prefilter  be  denoted  by 


X(<o)  =  P(<o)Y(to) 


( A  -  J 1 ) 


then  the  pitch  likelihood  function  reduces  to  the  correlation  operation 


C(to;x)  X(w)  X*(co)du> 


C A -12) 


Since  the  function  C(<o;x)  was  defined  to  represent  the  discrete  periodic 
structure  of  the  speech  spectrum,  it  must  be  symmetric,  non-negative  and 
periodic.  In  the  ideal  case  a  suitable  representation  is 


oo 

C(«;t)  =  i  £  «(“-  ~ )  CA-13) 

m=_cx> 

Since  speech  is  at  most  quasi-periodic,  a  more  practical  choice  for 
C(to;x)  is  (other  choices  are  possible) 


C  (to ;  t  ) 


— (1+cos  iot) 


(A- 14) 


which  is  the  DFT  of  the  sequence 


c  =  4d  +4s  +-r<5 

n  2  n  4  n-x  4  n+x 
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hence  the  likelihood  ratio  can  be  written  as 


i  n  r  1  l 

l( T)  «  X  X  <~(x  ♦x  x  ) 

v  J  2  Zj  n  |_  n  2  n-x  n+xM 


(A- lb) 


n=  1 


Since  only  the  second  term  depends  on  the  pitch  period,  then  the  final 
expression  for  the  likelihood  function  is 

*(T)  =  2Vn(T) 

n«l 


CA-17) 


where 


x  f t )  =  4(x  +x  ) 
n  J  2  n-x  n+x 


IA-18) 


which  represents  the  output  of  a  comb  filter  tuned  to  pitch  period  x. 

Of  course  a  bank  of  comb  filters  is  needed,  each  tuned  to  a  different 
pitch  period,  and  the  one  that  leads  to  the  largest  correlation  determines 
the  maximum  likelihood  estimate  of  the  pitch  period. 
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